Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml : add SSM Metal kernels #8546

Merged
merged 2 commits into from
Aug 26, 2024
Merged

ggml : add SSM Metal kernels #8546

merged 2 commits into from
Aug 26, 2024

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Jul 17, 2024

ref #6758

Straightforward Metal implementation of SSM_CONV and SSM_SCAN using single-threaded kernels, mimicking the CPU implementation. Lot's of room for further optimizations, for now assuring correctness

./llama-batched \
  -m ./models/mamba-130m/ggml-model-f16.gguf \
  -p "Hello, my name is" -np 16 -n 32
main: n_predict = 32, n_ctx = 448, n_batch = 32, n_parallel = 16, n_kv_req = 437

Hello, my name is

main: generating 16 sequences ...

main: stream 0 finished at n_cur = 32
main: stream 1 finished at n_cur = 32
main: stream 2 finished at n_cur = 32
main: stream 3 finished at n_cur = 32
main: stream 4 finished at n_cur = 32
main: stream 5 finished at n_cur = 32
main: stream 6 finished at n_cur = 32
main: stream 7 finished at n_cur = 32
main: stream 8 finished at n_cur = 32
main: stream 9 finished at n_cur = 32
main: stream 10 finished at n_cur = 32
main: stream 11 finished at n_cur = 32
main: stream 12 finished at n_cur = 32
main: stream 13 finished at n_cur = 32
main: stream 14 finished at n_cur = 32
main: stream 15 finished at n_cur = 32

sequence 0:

Hello, my name is Tiffany. I'm a mother of three and a retired teacher. I'm a member of the American Indian and Alaska Native (AI

sequence 1:

Hello, my name is John. I am a freelance writer and editor. I have a passion for writing and have been writing since I was a child. I

sequence 2:

Hello, my name is Renee. I'm a full-time writer, and I'm currently working on a new book. I'm also a graduate

sequence 3:

Hello, my name is Jules. I'm a writer and illustrator. I have a passion for the arts and I love to travel. I love to

sequence 4:

Hello, my name is Renee. I am a single mom of two boys. I am trying to figure out how to make this work. I am

sequence 5:

Hello, my name is Dr. Sonia. I'm a doctor in the University of Medicine and Dentistry of New Jersey. I'm here to help you

sequence 6:

Hello, my name is Nick. I'm a member of the
  National Association of Women in the United States of America. I'm
  a member

sequence 7:

Hello, my name is Jadine. I'm a real person, and I'm here to help you. I'm here to help you get the best

sequence 8:

Hello, my name is Roxane and I'm a young woman with a love of all things chocolate. I've been a member of the Chocolate Club for

sequence 9:

Hello, my name is John. I'm a professional musician, and I'm looking for a new job. I'm a musician, and I'm looking for

sequence 10:

Hello, my name is Dr. Paul, and I'm a doctor in the area of cardiac surgery. I'm here to help you. I'm here to

sequence 11:

Hello, my name is Daniel and I'm a teacher in an elementary school in the United States. I've been reading about the dangers of the internet for the

sequence 12:

Hello, my name is Sven, and I'm a member of the Sven-Gustavsson Foundation. I'm here to talk about the future

sequence 13:

Hello, my name is Nico, I'm a professional photographer, I work in the studio of the famous photographer, Josef Krammer, who is

sequence 14:

Hello, my name is John. I'm a big fan of your work. I'm looking for a job. I'm looking for a good, honest man

sequence 15:

Hello, my name is John. I'm a newbie to the Internet, and I'm trying to learn how to use it.
I'm trying to

main: decoded 432 tokens in 0.71 s, speed: 609.55 t/s

llama_print_timings:        load time =     137.83 ms
llama_print_timings:      sample time =      10.18 ms /   448 runs   (    0.02 ms per token, 44025.16 tokens per second)
llama_print_timings: prompt eval time =     727.16 ms /   437 tokens (    1.66 ms per token,   600.97 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =     845.80 ms /   438 tokens

ggml_metal_free: deallocating
./llama-perplexity \
  -m ./models/mamba-130m/ggml-model-f16.gguf \
  -f build/wikitext-2-raw/wiki.test.raw -ngl 99
perplexity: tokenizing the input ..
perplexity: tokenization took 950.02 ms
perplexity: calculating perplexity over 650 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 0.55 seconds per pass - ETA 1.48 minutes
...
Final estimate: PPL = 25.0894 +/- 0.18559

@ggerganov ggerganov changed the title llama : advanced batch splits ggml : add SSM Metal kernels Jul 17, 2024
@github-actions github-actions bot added the testing Everything test related label Jul 17, 2024
@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jul 18, 2024
@ggerganov ggerganov marked this pull request as ready for review July 18, 2024 12:51
@mofosyne mofosyne added the Review Complexity : High Generally require indepth knowledge of LLMs or GPUs label Jul 19, 2024
@ggerganov ggerganov changed the base branch from compilade/batch-splits to master August 26, 2024 09:26
@ggerganov ggerganov merged commit fc18425 into master Aug 26, 2024
8 checks passed
@ggerganov ggerganov deleted the gg/metal-ssm branch August 26, 2024 14:55
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Aug 27, 2024
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
* ggml : add ggml_ssm_conv metal impl

* ggml : add ssm_scan metal impl

ggml-ci
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
* ggml : add ggml_ssm_conv metal impl

* ggml : add ssm_scan metal impl

ggml-ci
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Review Complexity : High Generally require indepth knowledge of LLMs or GPUs testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants